OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The benchmark executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code. A run time Affinity option is available to execute the benchmark on a selected single processor.
The benchmarks demonstrate a near doubling of performance, using dual core processors, when not limited by memory speed and when the source code is compatible with Single Instruction Multiple Data (SIMD) operation. All that is needed for the speed increase is an extra directive in the source code (implying parallelise this) and a compilation parameter. Later tests show up to four times faster speeds using a quad core processor.
Potential performance gains due to hardware SIMD with SSE instructions are not realised due to compiler limitations and this enhances the comparative benefit of CUDA GPU parallel processing. On the other hand, the benchmark, compiled for 64 bit working, demonstrates significant speed improvement using the eight additional SSE registers that are available. Then, it also appears that certain compiler optimisation options (like loop unrolling) cannot be implemented on using OpenMP.
The benchmarks identify three slightly different numeric results on tests using SSE, old i387 and CUDA floating point instructions. Results output has been revised to provide more detail.
Other benchmarks have been converted to run using OpenMP and are described in OpenMP Speeds.htm. Observations are that performance with smaller data arrays can be extremely slow, due to high startup overheads, and wrong numeric results can be produced with careless use of OpenMP directives.
The benchmarks can be downloaded via OpenMPMflops.zip. No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.
The OpenMP tests have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see linux benchmarks.htm, linux openmp benchmarks.htm and download benchmark execution files, source code, compile and run instructions in linux_openmp.tar.gz. Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.
See
GigaFLOPS Benchmarks.htm
for further details and results, including comparisons with MP MFLOPS, a threaded C version, CUDA MFLOPS, for GeForce graphics processors, and Qpar MFLOPS, where Qpar is Microsoft痴 proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via
gigaflops-benchmarks.zip.
OpenMP is a system independent set of procedures and software that arranges parallel processing of shared memory data. This option is available in the latest C++ Microsoft compilers.
In this case, 32 bit and 64 bit versions used were from the free
Windows Driver Kit Version 7.0.0.
For OpenMP, Microsoft Visual C++ 2008
Redistributable Packages for x86 and x64
were also downloaded.
For comparison purposes, the OpenMP benchmarks execute the same functions as the CUDA tests - see
Benchmark Details.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words.
All that is required to arrange for the code to be run on more than one CPU is a simple directive:
#pragma omp parallel for for(i=0; i < n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
There are some issues with the Microsoft compilers that limit performance. Using SSE instructions, the hardware registers can contain four data words to permit such as four simultaneous adds - Single Instruction Multiple Data (SIMD) operation - as OpenMP directives. The compilers appear to only generate the single data type instructions (SISD), operating on 32 bits out of the 128 bits provided.
Using processors at 64 bits, the old i387 instructions are not available and SSE types have to be used but more registers are available for optimisation. The 64 bit version of the benchmark at least demonstrates more than one floating point result per CPU clock cycle (linked add and multiply?).
Results are provided for an Athlon 64 X2 with XP x64, Core 2 Duo processors using 32-Bit and 64-Bit Vista, a Phenom II X4 via 64-Bit Windows 7 and a Core i7 again using Windows 7. On one CPU of a 2.4 GHz Core 2 Duo, up to 3.5 GFLOPS is produced or 6.8 GFLOPS using both processors. Corresponding results for a four processor 3 GHz Phenom II are 3.7 and 14.5 GFLOPS.
The quad CPU Core i7 results are difficult to interpret, the first issue being that Hyperthreading is available where 8 threads can be run at the same time and this could have some impact even with purely floating point calculations. The main problem is Turbo Boost where, using a single CPU, it can run much faster than its rated MHz. Even four processors can run faster than the rating if not too hot. Results provided are for two 2.8 GHz i7 processors with different Turbo Boost speeds of up to 3.066 GHz and 3.466 GHz.
At 32 bits, the latest compilers refuse to obey the /arch:SSE parameter and produce only i387 floating point instructions. The ZIP file contains SSE32MFLOPS.exe, a single processor version, produced for SSE operation via an earlier compiler. Some results are given below.
The benchmark can be downloaded via OpenMPMflops.zip. No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.
The benchmarks have run time parameters to change the number of words used and repeat passes that might need adjusting for timing purposes. There is also an option to select a single processor via an Affinity setting. BAT files containing examples of run time parameters are in the ZIP file.
The CUDA graphics parallel computing benchmark
has three lots of tests where two do not involve transferring data to and/or from the host cpu痴 memory. The tests here can be compared with the CUDA "Data in & out" test. Below is a sample log file for the 64 Bit version on a 2.4 GHz Core 2 Duo via Vista. The second results are for a single selected CPU.
64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:19 2009 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.194304 2573 0.929538 Yes Data in & out 1000000 2 250 0.193139 2589 0.992550 Yes Data in & out 10000000 2 25 0.415691 1203 0.999250 Yes Data in & out 100000 8 2500 0.312285 6404 0.957117 Yes Data in & out 1000000 8 250 0.335818 5956 0.995517 Yes Data in & out 10000000 8 25 0.473814 4221 0.999549 Yes Data in & out 100000 32 2500 1.488048 5376 0.890211 Yes Data in & out 1000000 32 250 1.891056 4230 0.988082 Yes Data in & out 10000000 32 25 1.185456 6748 0.998796 Yes 64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:31 2009 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64 Single CPU Affinity 1 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.313641 1594 0.929538 Yes Data in & out 1000000 2 250 0.317088 1577 0.992550 Yes Data in & out 10000000 2 25 0.431107 1160 0.999250 Yes Data in & out 100000 8 2500 0.584243 3423 0.957117 Yes Data in & out 1000000 8 250 0.594728 3363 0.995517 Yes Data in & out 10000000 8 25 0.605958 3301 0.999549 Yes Data in & out 100000 32 2500 2.268676 3526 0.890211 Yes Data in & out 1000000 32 250 2.261049 3538 0.988082 Yes Data in & out 10000000 32 25 2.270906 3523 0.998796 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information AMD64 processor architecture, 2 CPUs Windows NT Version 6.0, build 6002, Service Pack 2 Memory 4095 MB, Free 2854 MB User Virtual Space 8388608 MB, Free 8388560 MB
Following are results of the single processor SSE test, the 32 bit i387 OpenMP benchmark, the 64 bit SSE OpenMP version and MFLOPS obtained using CUDA. The latter are given for tests with copying data from/to host RAM or to/from graphics RAM and those for graphics processor calculations without external data transfers. Systems used are a Core 2 Duo with 64-Bit Vista and an AMD Athlon 64 X2 using XP x64, followed by a Pentium 4 with 32-Bit XP and a Core 2 Duo laptop with 32-Bit Vista. Later results are for a quad core Phenom II CPU using 64-Bit Windows 7 and a much faster graphics card. Even later are for a quad core Intel i7 processor, with a top end graphics card and, again, using 64-Bit Windows 7. This processor can use Hyperthreading and appears to Windows as having eight CPUs. Latest results are for a Core i5 dual that also has Hyperthreading.
Single, Dual and Quad CPUs - Appropriate performance gains are obvious on increasing the number of calculations per memory access. With two calculations per word there can be little gain using more than one CPU, as performance is limited by main memory speed. Some results on the AMD Athlon CPU reflect the smaller, slower L2 cache.
CPU and GPU - Particularly as the compiler used does not fully implement SSE SIMD instructions, the GPU CUDA operations can be attractively fast, the latest results showing up to 24 times faster using a GTX 480.
SSE and i387 - Again because of compiler limitations, the old i387 floating point instructions can produce comparable performance, in some cases.
32 and 64 Bit SSE - Faster performance using a 64-Bit compilation could be expected, due to the availability of more registers for optimisation, but this is not always the case. Examination of actual intermediate machine code instructions can provide an explanation (see below).
Hyperthreading - This does not appear to improve maximum throughput of the four core i7 by much more than four times. Using one or two threads, the processors are likely to be running at the Turbo Boost speed of 3066 MHz but falling back to 2933 MHz with four threads (or 2800 MHz if hot), reducing relative performance.
For more details and Hyperthreading results with other benchmarks see
Quad Core 8 Thread.htm.
Core i7 930 2.8 GHz increased by Turbo Boost up to 3.066 GHz using 1 CPU and up to 2.933 GHz using 4 CPUs - Windows 7 64 - MFLOPS CUDA CUDA Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O Words Word 1 CPU 1 CPU 4/8 CPU 1 CPU 4/8 CPU GTX480 GTX480 100000 2 3567 1248 4455 1574 4001 521 5554 1000000 2 3529 1420 5433 1861 4919 819 21493 10000000 2 2388 1364 3038 1735 3076 1014 31991 100000 8 4655 2337 8798 3794 14581 2058 20129 1000000 8 4642 2413 9813 4149 17080 3306 82132 10000000 8 4453 2436 9581 4011 12457 4057 125413 100000 32 3328 2957 12020 4324 16786 7768 52230 1000000 32 3329 3011 12339 4436 17599 13190 254306 10000000 32 3307 3003 12432 4418 17576 16077 425237 Phenom II X4 3.0 GHz, Windows 7 64 - MFLOPS CUDA CUDA Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O Words Word 1 CPU 1 CPU 4 CPU 1 CPU 4 CPU GTS250 GTS250 100000 2 3552 1920 5587 1822 5613 328 3054 1000000 2 3268 1919 5585 1870 7056 625 9672 10000000 2 1861 1625 2993 1563 2972 714 13038 100000 8 4535 2115 7763 3637 12653 1336 12233 1000000 8 4341 2108 7975 3709 14518 2382 39481 10000000 8 4141 2100 8062 3543 11273 2949 51199 100000 32 4012 2566 9675 3652 14092 5142 36080 1000000 32 3981 2552 10091 3663 14510 9427 108170 10000000 32 3941 2510 9902 3633 14034 11182 135041 Core 2 Duo 2.4 GHz, Vista 64 - MFLOPS CUDA CUDA Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O Words Word 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 8600GT 8600GT 100000 2 2524 1599 2660 1594 2573 215 1770 1000000 2 2353 1617 2957 1577 2589 342 3479 10000000 2 1158 1180 1136 1160 1203 417 3874 100000 8 3647 2063 3948 3423 6404 886 6931 1000000 8 3445 2070 3624 3363 5956 1371 13250 10000000 8 3231 2058 3962 3301 4221 1661 14281 100000 32 2590 2653 4909 3526 5376 3329 16583 1000000 32 2659 2658 4580 3538 4230 5019 27027 10000000 32 2663 2649 5183 3523 6748 5975 28923 Core i5-2467M 1.6 GHz to 2.3 GHz Turbo Boost Dual Core + Hyperthreading, Windows 7 - MFLOPS Data Ops/ SSE i387 SSE 64b Words Word 1 CPU 2 CPU 2 CPU 100000 2 1611 975 1613 1000000 2 2247 2100 1917 10000000 2 1625 1603 1681 100000 8 2829 2621 3524 1000000 8 3248 2756 3604 10000000 8 3458 2844 5377 100000 32 3308 3691 4032 1000000 32 3330 3994 4178 10000000 32 3322 4898 5041 AMD Athlon 64 X2 2.2 GHz, XP x64 - MFLOPS A64 A64 A64 A64 A64 Data Ops/ SSE i387 i387 SSE 64b SSE 64b Words Word 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 100000 2 1304 1060 1961 1114 2015 1000000 2 659 639 812 638 817 10000000 2 665 640 837 636 831 100000 8 2084 1495 2922 1942 3783 1000000 8 1853 1369 2629 1692 3058 10000000 8 1861 1376 2701 1706 3110 100000 32 2488 1852 3428 1731 3254 1000000 32 2439 1813 3614 1793 3369 10000000 32 2443 1818 3629 1774 3443 32 Bit Windows Pentium 4 Core 2 Duo Laptop Atom Netbook MFLOPS MFLOPS MFLOPS CPU P4 P4 C2D C2D C2D Atom Atom Atom MHz 1900 1900 1829 1829 1829 1600 1600 1600 XP32 XP32 V32 V32 V32 XP32 XP32 XP32 Data Ops/ SSE i387 SSE i387 i387 SSE i387 i387 Words Word 1 CPU 1 CPU 1 CPU 1 CPU 2 CPU 1 CPU No HT HT 100000 2 221 223 1811 1201 2063 264 175 323 1000000 2 224 224 673 650 630 259 185 311 10000000 2 204 206 651 668 650 258 189 331 100000 8 835 742 2648 1558 2773 409 257 460 1000000 8 817 699 2326 1529 2568 406 263 443 10000000 8 764 771 2331 1508 2645 406 265 475 100000 32 1160 1017 1935 1978 3627 457 369 679 1000000 32 1163 1025 1970 1977 3719 456 371 679 10000000 32 1165 1029 2015 1921 3727 456 372 677 Single processor Atom i387 results Hyperthreading off and on
Following are OpenMP benchmark result for the version compiled for 64 bit working, with performance gains shown when using multiple processors. These gains are lowest using 10M words (40 MB) with an add and a multiply for each word read, limited by RAM speed. There is generally no such limitation with 32 operations per word at all data sizes.
These results include those for two 2.8 GHz Core i7 CPUs that have different Turbo Boost characteristics. In this case, the i7 860 had been detuned and, based on results with 32 operations per word, single CPU tests suggest that both were running at around 3 GHz, with Core i7/Core 2 measured speed ratios similar to MHz ratios (3066/2400 = 4510/3530). The i7 860 has faster RAM, affecting tests with fewer operations per word.
64 Bit OpenMP Benchmark MFLOPS Athlon 64 x2 Core 2 Duo Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain Words Word 1 CPU 2 CPU 1 CPU 2 CPU 100000 2 1114 2015 1.8 1594 2573 1.6 1000000 2 638 817 1.3 1577 2589 1.6 10000000 2 636 831 1.3 1160 1203 1.0 100000 8 1942 3783 1.9 3423 6404 1.9 1000000 8 1692 3058 1.8 3363 5956 1.8 10000000 8 1706 3110 1.8 3301 4221 1.3 100000 32 1731 3254 1.9 3526 5376 1.5 1000000 32 1793 3369 1.9 3538 4230 1.2 10000000 32 1774 3443 1.9 3523 6748 1.9 Phenom II Core i7 860 Core i7 930 Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain Words Word 1 CPU 4 CPU 1 CPU 4 CPU 1 CPU 4 CPU 100000 2 1822 5613 3.1 1661 4263 2.6 1574 4001 2.5 1000000 2 1870 7056 3.8 1922 5142 2.7 1861 4919 2.6 10000000 2 1563 2972 1.9 1824 3838 2.1 1735 3076 1.8 100000 8 3637 12653 3.5 3939 13804 3.5 3794 14581 3.8 1000000 8 3709 14518 3.9 4251 18082 4.3 4149 17080 4.1 10000000 8 3543 11273 3.2 4133 15079 3.6 4011 12457 3.1 100000 32 3652 14092 3.9 4438 16299 3.7 4324 16786 3.9 1000000 32 3663 14510 4.0 4512 18081 4.0 4436 17599 4.0 10000000 32 3633 14034 3.9 4493 17752 4.0 4418 17576 4.0 i7 860 2.8 GHz, Turbo Boost possible to 3.47 GHz using 1 CPU to 2.93 GHz using 4 i7 930 2.8 GHz, Turbo Boost possible to 3.07 GHz using 1 CPU to 2.93 GHz using 4
The benchmarks were compiled using the /Fa option which produces a file with assembly code listing. These show significant differences between 64 bit and 32 bit compilations, also if the /openmp directive is or is not included.
The most obvious difference is when using two operations per word, where 32 bit compilation unrolls the loop (using x[i], x[i+1], x[i+2] and x[i+3] with four times as many calculations). This results in some much faster speeds for the 32 bit version. A further 64 bit compilation, without /openmp, included unrolling.
For the other extreme, where 64 bit compilation is much faster, memory accesses are reduced by using the additional registers. These accesses are indicated by such as addss xmm6, DWORD PTR _g$[esp] and the extra registers by xmm8 to xmm15 (really needs 24 registers - CUDA has more).
64 Bit SSE Instructions 32 Bit SSE Instructions 2 Operations Per Word for(i=0; i< n; i++) x[i]=(x[i]+a)*b; $LL6@triad$omp$: $L56949: ; Line 77 ; Line 77 movaps xmm0, xmm1 movss xmm2, DWORD PTR [eax-8] add rax, 4 addss xmm2, xmm1 sub rcx, 1 mulss xmm2, xmm0 addss xmm0, DWORD PTR [rax-4] movss DWORD PTR [eax-8], xmm2 mulss xmm0, xmm2 movaps xmm2, xmm1 movss DWORD PTR [rax-4], xmm0 addss xmm2, DWORD PTR [eax-4] jne SHORT $LL6@triad$omp$ mulss xmm2, xmm0 movss DWORD PTR [eax-4], xmm2 movss xmm2, DWORD PTR [eax] addss xmm2, xmm1 mulss xmm2, xmm0 movss DWORD PTR [eax], xmm2 movss xmm2, DWORD PTR [eax+4] addss xmm2, xmm1 mulss xmm2, xmm0 movss DWORD PTR [eax+4], xmm2 add eax, 16 dec edx jne SHORT $L56949 8 Operations Per Word for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; $LL6@triadplus$: $L56942: ; Line 69 ; Line 69 movss xmm1, DWORD PTR [rcx] movss xmm6, DWORD PTR [eax-8] add rcx, 4 movss xmm7, DWORD PTR [eax-8] sub rax, 1 addss xmm7, xmm3 movaps xmm2, xmm1 mulss xmm7, xmm2 movaps xmm0, xmm1 addss xmm6, xmm5 addss xmm1, xmm7 mulss xmm6, xmm4 addss xmm2, xmm3 subss xmm6, xmm7 addss xmm0, xmm5 movss xmm7, DWORD PTR [eax-8] mulss xmm1, xmm8 addss xmm7, xmm1 mulss xmm2, xmm4 mulss xmm7, xmm0 mulss xmm0, xmm6 addss xmm6, xmm7 subss xmm2, xmm0 movss DWORD PTR [eax-8], xmm6 addss xmm2, xmm1 movaps xmm6, xmm5 movss DWORD PTR [rcx-4], xmm2 addss xmm6, DWORD PTR [eax-4] jne SHORT $LL6@triadplus$ mulss xmm6, xmm4 movaps xmm7, xmm3 addss xmm7, DWORD PTR [eax-4] mulss xmm7, xmm2 subss xmm6, xmm7 movaps xmm7, xmm1 addss xmm7, DWORD PTR [eax-4] mulss xmm7, xmm0 addss xmm6, xmm7 movss xmm7, DWORD PTR [eax] movss DWORD PTR [eax-4], xmm6 movss xmm6, DWORD PTR [eax] addss xmm7, xmm3 mulss xmm7, xmm2 addss xmm6, xmm5 mulss xmm6, xmm4 subss xmm6, xmm7 movss xmm7, DWORD PTR [eax] addss xmm7, xmm1 mulss xmm7, xmm0 addss xmm6, xmm7 movss xmm7, DWORD PTR [eax+4] movss DWORD PTR [eax], xmm6 movss xmm6, DWORD PTR [eax+4] addss xmm7, xmm3 addss xmm6, xmm5 mulss xmm7, xmm2 mulss xmm6, xmm4 subss xmm6, xmm7 movss xmm7, DWORD PTR [eax+4] addss xmm7, xmm1 mulss xmm7, xmm0 addss xmm6, xmm7 movss DWORD PTR [eax+4], xmm6 add eax, 16 dec edx jne $L56942 32 Operations Per Word for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k -(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y; $LL6@triadplus2: $L56934: ; Line 61 ; Line 61 movss xmm2, DWORD PTR [rbp] movss xmm5, DWORD PTR [edx+ecx*4] add rbp, 4 addss xmm5, DWORD PTR _a$[esp] sub r12, 1 mulss xmm5, DWORD PTR _b$[esp] movaps xmm0, xmm2 movss xmm6, DWORD PTR [edx+ecx*4] movaps xmm1, xmm2 addss xmm6, DWORD PTR _c$[esp] movaps xmm3, xmm2 mulss xmm6, DWORD PTR _d$[esp] addss xmm0, xmm6 subss xmm5, xmm6 addss xmm3, xmm4 movss xmm6, DWORD PTR [edx+ecx*4] addss xmm1, xmm8 addss xmm6, DWORD PTR _e$[esp] mulss xmm0, xmm7 mulss xmm6, DWORD PTR _f$[esp] mulss xmm3, xmm5 addss xmm5, xmm6 mulss xmm1, xmm9 movss xmm6, DWORD PTR [edx+ecx*4] subss xmm3, xmm0 addss xmm6, DWORD PTR _g$[esp] movaps xmm0, xmm2 mulss xmm6, DWORD PTR _h$[esp] addss xmm3, xmm1 subss xmm5, xmm6 addss xmm0, xmm10 movss xmm6, DWORD PTR [edx+ecx*4] movaps xmm1, xmm2 addss xmm6, DWORD PTR _j$[esp] mulss xmm0, xmm11 mulss xmm6, DWORD PTR _k$[esp] subss xmm3, xmm0 addss xmm5, xmm6 addss xmm1, xmm12 movss xmm6, DWORD PTR [edx+ecx*4] movaps xmm0, xmm2 addss xmm6, DWORD PTR _l$[esp] mulss xmm1, xmm13 mulss xmm6, DWORD PTR _m$[esp] addss xmm3, xmm1 subss xmm5, xmm6 addss xmm0, xmm14 movss xmm6, DWORD PTR [edx+ecx*4] movaps xmm1, xmm2 addss xmm6, DWORD PTR _o$[esp] mulss xmm0, xmm15 mulss xmm6, DWORD PTR _p$[esp] addss xmm1, DWORD PTR [rax] addss xmm5, xmm6 subss xmm3, xmm0 movss xmm6, DWORD PTR [edx+ecx*4] movaps xmm0, xmm2 addss xmm6, DWORD PTR _q$[esp] mulss xmm1, DWORD PTR [rcx] mulss xmm6, DWORD PTR _r$[esp] addss xmm0, DWORD PTR [rdx] subss xmm5, xmm6 addss xmm3, xmm1 movss xmm6, DWORD PTR [edx+ecx*4] mulss xmm0, DWORD PTR [r8] addss xmm6, DWORD PTR _s$[esp] movaps xmm1, xmm2 mulss xmm6, xmm4 addss xmm1, DWORD PTR [r9] addss xmm5, xmm6 subss xmm3, xmm0 movss xmm6, DWORD PTR [edx+ecx*4] mulss xmm1, DWORD PTR [r10] addss xmm6, xmm3 movaps xmm0, xmm2 mulss xmm6, xmm2 addss xmm0, DWORD PTR [r11] subss xmm5, xmm6 addss xmm2, DWORD PTR [rdi] movss xmm6, DWORD PTR [edx+ecx*4] mulss xmm0, DWORD PTR [rbx] addss xmm6, xmm1 mulss xmm2, DWORD PTR [rsi] mulss xmm6, xmm0 addss xmm3, xmm1 addss xmm5, xmm6 subss xmm3, xmm0 movss DWORD PTR [edx+ecx*4], xmm5 addss xmm3, xmm2 inc ecx movss DWORD PTR [rbp-4], xmm3 cmp ecx, edi jne $LL6@triadplus2 jl $L56934
The run time display and log files show the numeric result of calculations and values from using the same default parameters are shown below. There is some variation in rounding after calculations, different with SSE, i387 and CUDA instructions.
4 Byte Ops Repeat SSE i387 i387 SSE 64b SSE 64b SSE 64b CUDA Words /Wd Passes 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 4 CPU 8600GT 100000 2 2500 0.929538 0.929475 0.929475 0.929538 0.929538 0.929538 0.929538 1000000 2 250 0.992550 0.992543 0.992543 0.992550 0.992550 0.992550 0.992550 10000000 2 25 0.999250 0.999249 0.999249 0.999250 0.999250 0.999250 0.999250 100000 8 2500 0.957117 0.957164 0.957164 0.957117 0.957117 0.957117 0.956980 1000000 8 250 0.995517 0.995525 0.995525 0.995517 0.995517 0.995517 0.995509 10000000 8 25 0.999549 0.999550 0.999550 0.999549 0.999549 0.999549 0.999549 100000 32 2500 0.890211 0.890377 0.890377 0.890211 0.890211 0.890211 0.890079 1000000 32 250 0.988082 0.988102 0.988102 0.988082 0.988082 0.988082 0.988073 10000000 32 25 0.998796 0.998799 0.998799 0.998796 0.998796 0.998796 0.998799